Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update proposal for maxUnavailable for statefulsets #1010

Merged
merged 1 commit into from
Sep 12, 2019
Merged

update proposal for maxUnavailable for statefulsets #1010

merged 1 commit into from
Sep 12, 2019

Conversation

krmayankk
Copy link

/sig apps

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 27, 2019
@krmayankk
Copy link
Author

/assign @bgrant0607 @kow3ns @janetkuo @Kargakis

@justaugustus
Copy link
Member

/assign @bgrant0607 @kow3ns @janetkuo @Kargakis

@krmayankk krmayankk changed the title update maxUnavailable for statefulsets update proposal for maxUnavailable for statefulsets Apr 29, 2019
@krmayankk krmayankk mentioned this pull request Apr 29, 2019
8 tasks
@krmayankk
Copy link
Author

FYI @furykerry @resouer @FillZpp

@FillZpp
Copy link

FillZpp commented Apr 30, 2019

Great, the description of choices is very clear. Also, i still recommend choice 3, it keeps balance between faster updating and ordering guarantee.

The problem of choice 1 is that it looks like sample, but actually does not. Let me take an example:

  1. First the statefulset with replicas 4 has pods [0-3], then update template and set maxUnavailable to 2.
  2. StatefulSet controller starts to update pod 2&3. With choise 1, controller will update remaining pods only if pod 2&3 are ready. But if pod 3 becomes ready first, currently unavailable pod is 2, how does controller know it should wait for 2 ready before updating remaining pods?
  3. You may say, controller calculates the pod updating groups with maxUavailable number, like [{0, 1}, {2, 3}]. But the question is, if replicas been changed to 5 during updating pods 2&3, then controller will split pods into [{0}, {1, 2}, {3, 4}].
  4. Then, you will find pod 2 and 3, which are in different groups, are updating together. Surprise!!

@krmayankk
Copy link
Author

You may say, controller calculates the pod updating groups with maxUavailable number, like [{0, 1}, {2, 3}]. But the question is, if replicas been changed to 5 during updating pods 2&3, then controller will split pods into [{0}, {1, 2}, {3, 4}].

Is this even possible ? The replicas cannot be scaled to 5 before the update finishes i believe. The update knows to start from highest ordinal and count maxUnavailablePods at a time.

Note the goal is faster rolling updates. In faster rolling updates, we could potentially provide two modes, one with ordering and one which says ordering in the maxUnavailable group is not guaranteed. I dont think we have enough data to say we need the modes. We can start with a option which makes most sense for faster updates. The consumer of faster updates would know the ordering is not guaranteed when using this feature and that is a conscious choice the user will make.

@FillZpp
Copy link

FillZpp commented May 3, 2019

Is this even possible ? The replicas cannot be scaled to 5 before the update finishes i believe.

Actually, it can. You may test this case in your Kubernetes cluster or minikube e.g.

So as i said before, controller with choice 1 will be confused when user modify replicas during updating.

with ordinal 2 will start Terminating. Pods with ordinal 0 and 1 will remain untouched due the partition. In this choice, the number of pods
terminating is not always maxUnavailable, but sometimes less than that. For e.g. if pod with ordinal 3 is running and ready but 4 is not, we
still wait for 4 to be running and ready before moving on to 2. This implementation avoids out of order Terminations of pods.
2: Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). When any of 4 or 3 are running and ready, pods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The three alternatives are getting merged into one paragraph; needs an extra newline between each alternative.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a newline

#### Implementation

TBD: Will be updated after we have agreed on the semantics beign discussed above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

being


I recommend Choice 1, simply because its easy to think and reason about.

Other choices are introduce a flag, which controls whether we want maxUnavailable with ordering or without ordering. I dont think we need that
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

List this as Choice 4?

@krmayankk
Copy link
Author

Is this even possible ? The replicas cannot be scaled to 5 before the update finishes i believe.

Actually, it can. You may test this case in your Kubernetes cluster or minikube e.g.

So as i said before, controller with choice 1 will be confused when user modify replicas during updating.

We can specify that in documentation, what the behavior will be. Note that changing the grouping is not a big deal. The main goal is faster updates, whether the pod gets updated in one grouping or another doesnt matter, unless you have a good reason it matters.
Happy to hear others opinion.

@krmayankk
Copy link
Author

Adding more people to see if they have opinions on the choices presented here
@smarterclayton @thockin @brendandburns @kubernetes/sig-apps-feature-requests

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jun 18, 2019
about. They are used by DaemonSet and StatefulSet for tracking revisions. It would be so much nicer if all the use cases of Deployments can be met and we
could track the revisions by ControllerRevisions.
4. Sometimes I just want easier tracking of revisions of a rolling update. Deployment does it through ReplicaSets and has its own nuances. Understanding
that requires diving into the complicacy of hashing and how ReplicaSets are named. Over and above that, there are some issues with hash collisions which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this point should be made clear in feedback to @kubernetes/sig-apps-api-reviews - are there any supporting statements we should have here about changes that should happen in deployments?

@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Jun 18, 2019
that requires diving into the complicacy of hashing and how ReplicaSets are named. Over and above that, there are some issues with hash collisions which
further complicate the situation(I know they were solved). StatefulSet introduced ControllerRevisions in 1.7 which are much easier to think and reason
about. They are used by DaemonSet and StatefulSet for tracking revisions. It would be so much nicer if all the use cases of Deployments can be met in
StatefulSet's and additionally we could track the revisions by ControllerRevisions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrm - that sounds like a dangerous statement. Are you saying stateless deployments should be implemented by StatefulSets?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton I will reword this. What i am trying to say is that sometimes , people just want this little extra thing which statefulsets provide which is stable identity, without the storage part , but everything else is exactly similar to deployments. They stick with deployments, because statefulset doesnt have maxUnavailable . If statefulset had maxUnavailable they could move to it and additionally take advantage of better tracking of versions.

In general I do feel that the distinction between statefulset and deployment is blurring. The only advantage at this point i can think of using deployments vs statefulset is that deployments can maxSurge but statefulset's cannot because of their identity. But that is a different discussion of making them as one object.


### Goals
StatefulSet RollingUpdate strategy will contain an additional parameter called `maxUnavailable` to control how many Pods will be brought down at a time,
during Rolling Update.

### Non-Goals
maxUnavailable is only implemeted to affect the Rolling Update of StatefulSet. Considering maxUnavailable for Pod Management Policy of Parallel is beyond
maxUnavailable is only implemented to affect the Rolling Update of StatefulSet. Considering maxUnavailable for Pod Management Policy of Parallel is beyond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton because podManagementPolicy is only applicable during StatefulSet creation, deletion and scaling. The main use case is faster rolling updates. We could make it applicable to create/delete/scale, but that is secondary to the goal. I thought we could just limit the scope of the initial implementation to make it less confusing, but if you think we should apply this to both, that is fine as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So during upgrade, if 3 of 5 nodes are evacuated, PMP is ignored? Since this is mostly established behavior of replicasets and deployments, you should mention why it's out of scope. I am more concerned with creating something that differs between RS/Deployment and StatefulSet right now that I am about implementation ordering, so I just want the KEP to reflect that it is a) consistent with user expectations and b) justifies statements about timing with reasons

@fejta-bot
Copy link

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

this time, only 1 is unavailable although we requested 2.
are the possible behavior choices we have:-

1. Pods with ordinal 4 and 3 will start Terminating at the same time(because of maxUnavailable). Once they are both running and ready, pods
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton do you have any suggestions/opinions on which of these semantics makes more sense when implementing maxUnavailable ?

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 18, 2019
@krmayankk
Copy link
Author

@smarterclayton @kow3ns based on discussion in sig-apps meeting and offline discussion with Clayton, I have updated the proposal. Please take and look and provide feedback.


##### Recommended Choice

I recommend Choice 4, using PMP=Parallel for the first Alpha Phase. This would give the users fast
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In particular, I think users will want the behvior of (1) above if they are using Ordered PodManagement and the behavior of (2) if they are using parallel. Consider implementing just these semantics. That is, if PMP is ordered the user is declaring that they care about termination ordering, even if they decide they are willing to tolerate a larger number of distruptions during an update. If the PMP is Parallel, the user does not care about about the termination ordering during turn-up/turn-down. If they want to do care about the termination ordering during update they can set the maxUnvailable field to 1 to preserve the current behavior. If they wish to tolerate a larger number of disruptions they can increase its value.

@krmayankk
Copy link
Author

@kow3ns can you approve the changes to mark it implementable so that I can get a Alpha in 1.16 ?

@praseodym
Copy link
Contributor

For reference: the Kruise third-party project provides an advanced StatefulSet implementation with maxUnavailable and more.

@FillZpp
Copy link

FillZpp commented Aug 19, 2019

I'm glad to see OpenKruise been noted in this KEP. @praseodym

Kruise is a set of controllers which extends and complements Kubernetes core controllers on workload management, and the AdvancedStatefulSet provides features like maxUnavailable, inPlaceUpdate and so on.

@kow3ns
Copy link
Member

kow3ns commented Aug 19, 2019

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 19, 2019
@krmayankk
Copy link
Author

@janetkuo @kow3ns can one of you also lgtm this ?

@k8s-ci-robot
Copy link
Contributor

@krmayankk: you cannot LGTM your own PR.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@kow3ns
Copy link
Member

kow3ns commented Sep 12, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 12, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kow3ns, krmayankk

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kow3ns
Copy link
Member

kow3ns commented Sep 12, 2019

I'm generally not if favor of both approving and lgtming something. However, we've discussed this in the SIG multiple times over a period of more than a month.

@k8s-ci-robot k8s-ci-robot merged commit 8e70bb6 into kubernetes:master Sep 12, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Sep 12, 2019
@krmayankk
Copy link
Author

thanks @kow3ns If there are any follow up comments , suggestions from other reviewers, i am happy to make follow up PR's while i make progress on the implementation

@kerthcet
Copy link
Member

Kindly ping @krmayankk Are we going to graduate this feature to beta in v1.26, or do we have any plans?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/apps Categorizes an issue or PR as relevant to SIG Apps. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet